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Abstract 

Background: As resequencing projects become more prevalent across a larger number of species, accurate variant 
identification will further elucidate the nature of genetic diversity and become increasingly relevant in genomic 
studies. However, the identification of larger genomic variants via DNA sequencing is limited by both the incomplete 
information provided by sequencing reads and the nature of the genome itself. Long-read sequencing technologies 
provide high-resolution access to structural variants often inaccessible to shorter reads. 

Results: We present PBHoney, software that considers both intra-read discordance and soft-clipped tails of long 
reads (> 10,000 bp) to identify structural variants. As a proof of concept, we identify four structural variants and two 
genomic features in a strain of Escherichia coli with PBHoney and validate them via denovo assembly. PBHoney is 
available for download at http://sourceforge.net/projects/pb-jelly/. 

Conclusions: Implementing two variant-identification approaches that exploit the high mappability of long reads, 
PBHoney is demonstrated as being effective at detecting larger structural variants using whole-genome Pacific 
Biosciences RS II Continuous Long Reads. Furthermore, PBHoney is able to discover two genomic features: the 
existence of Rac-Phage in isolate; evidence of £ coil's circular genome. 
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Background 

Structural variation results from numerous biological pro- 
cesses and has been implicated in a variety of diseases 
and phenotypes (see for review [1-5]). As resequencing 
projects become more prevalent across a larger num- 
ber of species, accurate variant identification will fur- 
ther elucidate the nature of genetic diversity and become 
increasingly relevant in genomic studies. However, the 
identification of structural variants via DNA sequenc- 
ing is limited by both the incomplete information pro- 
vided by sequencing reads and the nature of the genome 
itself. 

Next-generation sequencing (NGS) technologies gen- 
erate reads ranging from dozens to hundreds of base 
pairs (bp) in length and with relatively low per-base error 
rates. Moreover, many NGS technologies generate sets of 
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coordinated reads whose genomic separation is known 
a priori (e.g., paired-end and mate-pair reads). When 
reads generated from a sample sequence are aligned to a 
reference genome sequence, variation between the sam- 
ple and reference genomes manifests itself as imperfect 
mapping. NGS genomic variation detection methods take 
advantage of different types of imperfect mappings to 
detect different variant types. Variants smaller than the 
read length (traditionally single-nucleotide variants and 
indels) are identified via discordance (i.e., mismatches 
and gaps) between a sample read and the reference 
sequence [6]. Longer, structural variants include copy- 
number variants, inversions, and translocations. Depth- 
of-coverage methods infer copy-number variants from 
regions of non-uniform mapping coverage [7,8]. Split- 
read and paired-end methods both use reads or pairs of 
reads that map non-contiguously to characterize genomic 
rearrangements larger than the read itself [9,10]. 
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Mapping errors caused by genomic variation are dif- 
ficult to distinguish from those introduced by sequenc- 
ing errors and repetitive genomic sequence. Although 
NGS sequencing error rates are relatively low and their 
effects can often be mitigated with increased genomic 
coverage, repetitive sequence still creates mapping ambi- 
guity. Repetitive regions of the genome also exacerbate 
the search for genomic variation because many variants 
occur in these regions [11]. Moreover, NGS methods do 
not always completely characterize large structural vari- 
ants, often failing to provide full base-pair resolution of 
the entire region of interest. Finally, the efficacy of non- 
contiguous NGS methods can vary for different types of 
genomic variants depending on the sample data char- 
acteristics, such as insert-size distribution and coverage 
[12]. 

We can mitigate these limitations by taking advantage 
of continuous long reads generated by the Pacific Bio- 
sciences (PacBio) RS Sequencer. Each such read is a fully 
resolved sequence up to 30,000 bp. Despite a relatively 
high per-base error rate (~15%), PacBio reads lack sys- 
tematic biases and can be mapped with high accuracy 
[13]. In the present work, we describe two methods of 
identifying larger genomic variants via PacBio sequencing: 
interrupted mapping (PBHoney-Tails) and intra-read dis- 
cordance (PBHoney-Spots). While these methods parallel 
NGS methods conceptually, the longer PacBio reads can 
be more accurately mapped and can span larger genomic 
variants. And, unlike previously published approaches 
to finding structural variants with long' reads [14,15], 
PBHoney is designed to handle continuous long reads 
with lower base-accuracy. 

As proof of concept, we applied PBHoney to PacBio 
reads generated from Escherichia coli and identified four 
structural and two genomic features, each of which was 
confirmed via de novo assembly. 

Implementation 

Interrupted long-read mapping 

PacBio RS filtered subreads are first mapped to a reference 
genome with BLASR [13], an alignment tool optimized for 
reads thousands of base pairs long with higher error rates. 
The BLASR output is a SAM alignment [16] that contains 
each reads single best alignment. Any such best alignment 
does not necessarily map each read position to the refer- 
ence: the mapped read can be truncated prior to the 5' 
and 3' ends, creating an interrupted mapping represented 
by soft-clipped (i.e., unmapped) tails. In the present work, 
all tails longer than 200 bp are extracted from the SAM 
alignment and remapped to the reference genome with 
BLASR, which reports each tails best alignment. Thus, 
any mapped read comprises an initial alignment and up to 
two mapped tails, a 5' prolog and a 3' epilog, which when 
taken together compose a piece-alignment. These piece- 



alignments are placed in a new SAM file that contains 
tags for each alignment describing the locations and ori- 
entations of other members of the same piece-alignment. 
The piece-alignments of most filtered subreads comprise 
only an initial alignment, while only a small subset of these 
reads produce both prologs and epilogs. 

We next cluster piece-alignments with similarly mapped 
tails based on location and orientation. Figure 1 describes 
two sets of possible tail locations and orientations. 
Here, we only consider up to two components of piece- 
alignments (i.e., a prolog and initial or an initial and 
an epilog) because piece-alignments with more than two 
components due to structural variation and not low- 
quality sequence are rare. Should a read produce both a 
prolog and epilog, the alignment with the higher mapping 
quality is chosen. 

First, two piece-alignments are candidates for clustering 
if the corresponding component alignments have loca- 
tions that support breakpoints at similar positions by 
beginning and ending at a distance less than a user- 
defined buffer length. Buffer length is set at 200bp for this 
work and by default within the software. Second, to form a 
cluster, the piece-alignments must share the same internal 
orientation of component alignments. By only clustering 
events that satisfy both conditions, we can distinguish 
multiple variants that may share similar breakpoints, as is 
the case in the Figure 1 example. 

Each final cluster can contain any number of participat- 
ing piece-alignments (e.g., a single read with a mapped 
tail is considered a cluster). Using mapping orientations 
and location, we then annotate each cluster as a deletion, 
insertion, inversion, or translocation and predict break- 
points as the average interrupted position of each read. 
In this study we only report clusters with a minimum of 
three piece-alignments and a minimum average Phred- 
scale mapping quality value of 100. These minima exclude 
piece-alignments that are possibly the result of chimeras 
in the sample preparation and short, non-confidently 
mapped reads. 

Intra-read discordance 

PacBio RS reads have an experimentally determined 15% 
per-base error rate but lack systematic errors such as GC 
bias [17,18]. Because the errors are stochastic, we can 
identify discordant "spots" within the reference where the 
error rate is higher than expected. Using the final SAM 
file (which includes previously unmapped tails), we count 
the number of errors at every position in the reference. At 
any such position, each aligned subread can agree with the 
reference or produce a mismatch, deletion, or insertion. 
Each of these error channels' (mismatches, deletions, and 
insertions) and coverage is calculated and stored in a 4 x G 
integer array (A), where G is length of the reference. To 
identify regions of discordance we convolve the array with 
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Figure 1 Tail Schematic. Schematic of possible tails created by reads representing a deletion (a) and an inversion (c) allele and the structural 
variants they represent when mapped to the reference (b). Rectangles represent double-stranded genomic sequence. Arrows above and below a 
rectangle represent reads mapping to the direct and complement strands, respectively. In these examples, all initial alignments align at the 5' 
breakpoint of the reference. The read spanning a deletion event creates an epilog that maps to the 3' breakpoint on the same strand as its 
corresponding initial alignment. The reads spanning the inversion event breakpoints create prologs that map on the opposite strands of their 
corresponding initial alignment. While all three piece-alignments would cluster if we considered only their location, their orientations support two 
separate events in the reference region. 



several kernels. First, we obtain the error rate at each posi- 
tion by dividing the error channels by the coverage at each 
position: 

Eji = Aji/ Ci, 

where Aji is the value of the ;th channel at position i in 
the reference and Q is the coverage at that position. Next, 
we apply a smoothing kernel that replaces each value in 
the array with the mean channel value over a window of 
length 2B+1 centered at the associated genomic position 
L Formally, 



1 i+B 

Mn = V E jk . 

1 2B+1 V 



k=i-B 

We then obtain the standard deviation and mean for each 
channel across the whole chromosome. Finally, we cal- 
culate changes in discordance on a per-window basis by 
applying a slope kernel: 

1 / i-\ i+B \ 

s » = * £ M * - £ M * • 

\k=i-B k=i+l j 

Each array value now measures the extent to which 
the channel values before each position differ from the 



channel values after. Figure 2 illustrates the signal process- 
ing for a simulated ALU deletion [11]. 

Using the above channels, we identify possible struc- 
tural variants by extracting regions that contain increases 
in discordance (negative Sji values) followed by decreases 
in discordance (positive Sji values), corresponding to the 
starts and ends of genomic variants, respectively. To do so, 
we set discordance thresholds to N times each channels 
standard deviation, where N is a user-defined param- 
eter with an empirically determined default of 5. For 
each channel, we then extract peaks' that sit above these 
thresholds. The widths of these peaks determine the outer 
and inner boundaries for the variant breakpoints. Fur- 
thermore, we predict an exact breakpoint as the point 
of maximum discordance in the outer and inner bound- 
aries. Thus, a possible genomic variant is reported as two 
sets of genomic coordinates, {starts, start, start out ) and 
{endin, end, end out ) and a type determined by the channel 
(insertion, deletion, mismatch). These boundary coordi- 
nates allow us to account for the low base-error and 
realignment issues (such as repeats) that occur near most 
structural variants. 

Results 

We generated DNA from E. coli K12 strain MG1655 and 
created a 17 Kbp mean DNA insert-size using a Blue- 
Pippin preparation protocol (as recommended by Pacific 
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Simulated 327 bp ALU Deletion 
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Figure 2 Simulated ALU Deletion. Plot (a) depicts the raw channels for the 327 bp ALU Deletion. Raw channels include coverage (COV), 
mismatches (MIS), insertions (INS), and deletions (DEL). Plot (b) are the channels after smoothing, and plot (c) is the final signal after applying the 
slope kernel. The gray lines represent the start and end points of the deletion. 



Biosciences). The filtered subreads produced were on 
average 6.1 Kbp long (8.7 Kbp N50) and had a mean 
accuracy of 86.4% when mapped to the E. coli reference 
genome (GenBank accession U00096.2). A total of 95,778 
PacBio RS filtered subreads were generated with an aver- 
age length of 6.1 Kbp and an N50 length of 8.75 Kbp, 
providing an expected 126X average coverage of the 4.6 
Mbp E. coli genome. 

PacBio reads are capable of creating high quality assem- 
blies [19]. Therefore, before detecting variants, we assem- 
bled the PacBio reads to create a sample reference genome 
using the same nonhybrid HGAP assembly techniques 
to independently discover variants. The sample refer- 
ence genome comprised five contigs that uniquely covered 
84.8% of the E. coli reference genome with an N50 of 1.5 
Mbp. We then used MUMmer [20] to identify all variants 
greater than 100 bp between the newly assembled sample 
reference and the standard E. coli reference. This analy- 
sis identified a transposon deletion, a tandem duplication, 
and a tandem deletion. 

After mapping the reads to the E. coli genome and 
processing the alignment through PBHoney, we discov- 
ered evidence of E. colis circular genome, four structural 
variants, and evidence of Rac phage in the E. coli culture. 

Transposon deletion 

PBHoney identified a deletion with breakpoints at coor- 
dinates 1,976,520 and 1,977,300. With a length of less 



than 1,000 bp, this deletion is small enough for some 
PacBio reads to accurately map to the unvarying flanking 
sequence of the deletion in the reference, much in the 
same way that a mapped NGS read can span a small 
indel. While some PacBio reads are not long enough to 
span the deleted sequence, many of these reads' tails 
are long enough to map to the opposite side of the 
deletion. 

Tandem duplication and deletion 

We identifed an insertion of approximately 180 bp 
between the coordinates 1,096,766 and 1,096,817, and 
using the PacBio ALLORA assembly engine (Pacific Bio- 
sciences Menlo Park, CA), we resolved the full insertion 
sequence by assembling the reads that mapped to that 
region. 

To confirm the tandem nature of this insertion, we used 
Tandem Repeats Finder (TRF) [21] to identify 3.4 copies 
of a 181 bp repeat present in that region of the reference 
genome. When applied TRF to the assembly of sample 
reads and 4.4 copies were reported. 

Similarly, we identified a 113 bp deletion in the refer- 
ence between the coordinates 4,294,274 and 4,294,369. 
Applying the same methods, we found 5.3 copies of the 
repeat in the reference genome and 4.3 copies in the 
sample assembly. By remapping this assembly to the ref- 
erence genome, we found the deletion to sit between the 
coordinates 4,294,294 and 4,294,405. 
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P-Element inversion 

The el4 prophage of the E. coli genome contains a 1,828 
bp invertible P-element [22]. While this variant was too 
long to be spanned by mapped reads, we identified a 
subset of reads in the region that map in a manner suggest- 
ing an inversion between the coordinates 1,207,027 and 
1,208,827. 

These coordinates differ from the EcoGene (www. 
ecogene.org) annotated location of the inversion (1,207, 
013 and 1,208,841). This difference is attributable to the 
inverted repeats that flank the P-element and create align- 
ment ambiguity (i.e., aligning query bases to one copy of 
the repeat instead of the other). This ambiguity is over- 
come by shifting bases to a single copy of the repeat, which 
recreates the exact annotated breakpoints. 

Because the P-element inversion only occurs in a subset 
(28 reads in 133x coverage) of the E. coli organisms in a 
given culture, de novo assembly does not expose the event. 
However, our results allow us to easily identify the reads 
that do support the variant. By performing an assembly 
using the subset of reads we recovered the full inverted 
sequence. 

Rac prophage 

In addition to genomic variants inside the E. coli genome, 
we found 8 reads that had interrupted mapping at the 
boundaries of E. colis Rac prophage genomic feature. 
PBHoney annotated this event as a reference genome 
insertion. However, a more complete annotation is that 
these reads are the result of the defective bacteriophage s 
replication and its genome existing in isolate in our 
sample. When we assembled the reads that supported 
this event, we recovered the 25,556 bp circular genome 
sequence of the phage. 

Performance 

To assess how well PBHoney performs with lower cov- 
erage, we ran PBHoney on alignments down-sampled 
to 10X, and 20X coverage fifty times per cover- 
age with default parameters. We repeated this exper- 
iment four times while increasing and decreasing 



independently spots' minimum coverage parameter and 
mimimum standard-deviation threshold parameter. We 
then assessed PBHoneys ability to detect structural vari- 
ants at each run by comparing the detected variants to 
the four known variants and evidence of the circular 
genome (Table 1). It should be noted that with default 
parameters, 21 and 5 false negatives at 10 x and 20 x 
coverage respectively can be attributed to the missing 
P-element inversion. These false negatives are because 
the P-element inversion only occurs in a minority of 
E. coli organisms (~25%) and therefore isn't guaranteed 
to be represented in the down-sampled coverage. If we 
exclude the P-element inversion from our truth set, 10X 
coverage s sensitivity increases to 90.5% and 20x coverage 
to 94.5%. 

To benchmark the computational performance of 
PBHoney, we re-ran our E.coli dataset three times. On 
average, PBHoney spots was able to process the 93k reads 
in 27 minutes with a peak memory usage of 232mb and an 
average usage of 119mb. PBHoney tails processed reads 
on average in 28 seconds with a peak memory usage 
of 177mb and average usage of 90mb. In order to esti- 
mate a maximum resource usage of PBHoney, we tested 
lOx coverage (600k reads) simulated using blasrs alchemy 
program over Human chromosome 1 through spots pro- 
cessing and found a peak memory usage of 10.4gb (2gb 
average) over 3.5 hours of processing time. 

Discussion 

If DNA sequencing technologies could produce a single 
read of chromosome or genome length, variant identifica- 
tion would be a matter of comparing two similar strings. 
Such methodologies are already being applied in com- 
parative genomics structural studies [23] and de novo 
assembly methods of structural variation discovery [24]. 
However, given the computational challenges of whole 
genome de novo assembly, variant identification is lim- 
ited by our ability to accurately map sample reads to the 
reference genome. 

Many factors contribute to whether a read will span a 
variant region or create an interrupted mapping, including 



Table 1 Performance over 50 down-sampling experiments at 1 0X, and 20X coverage 
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237 


236 
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55 
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1 
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3 


46 


FN 


40 


68 


63 


40 


38 


16 


11 


22 


13 


14 


Sensitivity 


84.00% 


72.80% 


74.80% 


86.80% 


84.00% 


93.60% 


95.60% 


91.20% 


94.80% 


94.40% 


PPV 


89.36% 


100.00% 


96.39% 


79.78% 


41.02% 


99.57% 


1 00.00% 


99.56% 


98.75% 


83.69% 



True Positive (TP) False Positive (FP) False Negative (FN) counts, Sensitivity, and Positive Predictive Value (PPV). PPV is the probability any given variant call is true 
(TP/(TP+FP)). Parameters changed are Coverage (c) and Standard-deviation Threshold (e) for spots signal processing. 
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the length and quality of the read, the register of the 
read relative to the variant region, the mappability and 
size of the variant region, and the nature of the align- 
ment algorithm. The length and per-base error rates of 
PacBio reads allow structural variants to 'hide' inside of 
the noise of the stochastic errors. For example, a 500 bp 
deletion in the sample relative to the reference can be 
spanned by a 10 Kbp read because 500 additional errors' 
in the mapping of a 10 Kbp with 1500 expected errors 
can be insufficient to interrupt the mapping. Such hidden 
variants create intra-read discordance and are revealed by 
PBHoney-Spots. If the variant does create an interrupted 
mapping, PBHoney-Tails leverages this information to 
characterize the variant. By incorporating these two dis- 
tinct methods, PBHoney is insensitive to how a read maps 
to the reference, much as are NGS methods that use both 
paired-end and split-read information. PBHoney also lim- 
its itself to categorizing these variants in the context of our 
in silico abstraction of a genomic variant as a local string 
comparison. 

We have also presented an exploration of the parameter 
space by repeating out titration experiments with changes 
in the minimum coverage and minimum standard- 
deviation threshold parameters. Other parameters avail- 
able for the user include the minimum size of a tail to be 
considered for remapping, the minimum number of tailed 
reads needed to support a call, the minimum number of 
unique tailed reads needed to support (i.e. from different 
zero-mode-waveguides - this helps remove false-positives 
that are the result of chimeras), the minimum mapping 
quality of a read and its tail to be included in a variant call, 
and the minimum size of a structural variant. Since the 
most time consuming step in spots processing is count- 
ing the errors at any particular base (2.8 of 3.5 hours in 
the Human chromosome 1 simulation test), an hdf5 file is 
stored containing the arrays necessary to reprocess with 
different parameters should the user wish to tweak his or 
her spots results quickly. 

In the present work, PBHoney reports the breakpoint 
location and a mapping-based classification of each vari- 
ant as one of insertion, deletion, mismatch, inversion, 
and translocation. These results are sufficient to iden- 
tify reads for reassembly, from which the full sequence 
of the event and exact breakpoints can be recovered. 
More complex and biologically informed classification 
thus becomes an independent and subsequent step to 
mapping-based annotation. Samples with more biologi- 
cally complex variants still manifest themselves through 
the methods presented here, and when variants are con- 
sidered in a global context, the complex variation can 
be reconstructed. Future versions of PBHoney will auto- 
mate the assembly process and include more sophisticated 
variant classification that uses existing variant-specific 
tools such as Tandem Repeats Finder and novel haplotype 



reconstruction software to further elucidate the specific 
variant types that occur. 

For this works proof of concept, we processed the hap- 
loid E. coli genome and therefore did not include geno- 
typing information in our calls. However, estimates of 
genotype can currently be established by looking at the 
coverage of reads that support an alternate allele ver- 
sus supporting the reference. Future work will include 
automating this procedure. 

Conclusions 

Genomic variation detection faces many challenges 
when creating a completely characterized genome with 
identified large and complex variants. This work describes 
PBHoney, which leverages the high mappability of long 
reads to identify structural variants in a manner similar to 
the split-read and paired-end methods applied to shorter 
reads. The first continuous long-read specific structural 
variant software, PBHoney should prove valuable to rese- 
quencing efforts, particularly with regions inaccessible 
to short-read read mapping, specifically genomic regions 
subject to repetitive elements that are known to enrich for 
large variation events. 

Availability and requirements 

Project name: PBHoney 

Project home page: http://sourceforge.net/projects/pb- 
jelly 

Operating system(s): Platform independent 
Programming language: Python 

Other requirements: Python 2.7, samtools 0.1.17, blasr 
1.3.1, h5py 2.0.1, pysam 0.7.4, numpy 1.6 
License: FreeBSD. 
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